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The stochastic block model (SBM) provides a popular framework 
for modeling community structures in networks. However, more at¬ 
tention has been devoted to problems concerning estimating the la¬ 
tent node labels and the model parameters than the issue of choos¬ 
ing the number of blocks. We consider an approach based on the log 
likelihood ratio statistic and analyze its asymptotic properties un¬ 
der model misspecification. We show the limiting distribution of the 
statistic in the case of underfitting is normal and obtain its conver¬ 
gence rate in the case of overfitting. These conclusions remain valid 
when the average degree grows at a polylog rate. The results enable 
us to derive the correct order of the penalty term for model complex¬ 
ity and arrive at a likelihood-based model selection criterion that is 
asymptotically consistent. Our analysis can also be extended to a 
degree-corrected block model (DCSBM). In practice, the likelihood 
function can be estimated using more computationally efficient vari¬ 
ational methods or consistent label estimation algorithms, allowing 
the criterion to be applied to large networks. 


1. Introduction. Network modeling has attracted increasing research 
attention in the past few decades as the amount of data on complex systems 
accumulates at an unprecedented rate. Many complex systems in science and 
nature consist of interacting individual components which can be represented 
as nodes with connecting edges in a network. Network modeling has found 
numerous applications in studying friendship networks in sociology, Internet 
traffic in information technology, predator-prey interactions in ecology, and 
protein-protein interactions and gene regulatory mechanisms in molecular 
biology. 

One prominent feature of many of these networks is the presence of com¬ 
munities, where groups of nodes exhibit high internal connectivity. Commu¬ 
nities provide a natural division of the network into subunits with certain 
traits. In social networks, they often arise based on people’s common inter¬ 
ests and geographic locations. The World Wide Web forms communities or 
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hubs based on the content of the web pages. In gene networks, communi¬ 
ties correspond to genes with related functional groupings, many of which 
can act in the same biological pathway. Numerous heuristic algorithms have 
been proposed for detecting communities. However, a generative model is 
needed to study the problem from a theoretical perspective. 

The stochastic block model (SBM), proposed by Holland, Laskey and 
Leinhardt (1983) in social science, is one of the simplest random graph 
models incorporating community structures. It assigns each node a latent 
discrete block variable and the connectivity levels between nodes are de¬ 
termined by their block memberships. In practice, this model sometimes 
oversimplifies the structures of real networks and other variants have been 
proposed, including the degree-corrected SBM (DCSBM) (Karrer and New¬ 
man, 2011) relaxing the within-block degree homogeneity constraint and 
overlapping SBM (Airoldi et al., 2008) allowing a node to be in multiple 
blocks. These models have been applied to model real networks in social sci¬ 
ence and biology (Bickel and Chen, 2009; Daudin, Picard and Robin, 2008; 
Airoldi et al., 2008; Karrer and Newman, 2011). 

Much research effort has been devoted to the problems of estimating the 
latent block memberships and model parameters of a SBM, including mod¬ 
ularity (Newman, 2006a) and likelihood maximization (Bickel and Chen, 
2009; Amini et al., 2013), variational methods (Daudin, Picard and Robin, 
2008; Latouche, Birmele and Ambroise, 2012), spectral clustering (Rohe, 
Chatterjee and Yu, 2011; Fishkind et al., 2013), belief propagation (Decelle 
et al., 2011) to name but a few. The asymptotic properties of some of these 
methods have also been studied (Bickel and Chen, 2009; Rohe, Chatter¬ 
jee and Yu, 2011; Celisse et al., 2012; Bickel et al., 2013). However, these 
methods require knowing (or knowing at least a suitable range for) K , the 
number of blocks, a priori. Less attention has been paid to the problem of 
selecting K. 

For general networks this problem corresponds to the issue of determin¬ 
ing the number of communities, which remains a challenging open problem. 
Recursive approaches have been adopted to extract (Zhao, Levina and Zhu, 
2011) or divide (Bickel and Sarkar, 2013) one community sequentially, while 
using optimization strategies or hypothesis testing to decide whether the 
process should be stopped at one stage. A more general sequential test for 
comparing a fitted SBM against alternative models with finer structures is 
proposed in Lei (2014). Conceptually these approaches are more appealing 
for networks with a hierarchical structure. In other cases, it would be more 
desirable to be able to compare different community numbers directly. A 
few likelihood-based model selection criteria have been proposed (Daudin, 
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Picard and Robin, 2008; Latouche, Birmele and Ambroise, 2012; Saldana, 
Yu and Feng, 2014). From an information-theoretic perspective, Peixoto 
(2013) proposed a criterion based on minimum length description. These ap¬ 
proaches circumvent the difficulty of analyzing the likelihood directly by us¬ 
ing variational approximations or assuming the node labels are fixed and us¬ 
ing plug-in estimates obtained from other inference algorithms. Furthermore, 
the asymptotic studies of these criteria examining their large-sample perfor¬ 
mance remain incomplete. Empirically, a network cross-validation method 
has been investigated in Chen and Lei (2014). More recently, Le and Lev¬ 
ina (2015) proposed a method based on analyzing the spectral properties 
of graph operators, including the non-backtracking matrix and the Bethe 
Hessian matrix. 

In this paper, we directly address the challenges involved in analyzing 
the asymptotic distribution of the maximum log likelihood function under 
model misspecification. We show the log likelihood ratio statistic is asymp¬ 
totically normal in the case of underfitting. Although obtaining an explicit 
asymptotic distribution of the statistic in the case of overfitting is much more 
challenging, we have still derived its order of convergence and subsequently 
shown these two cases of misspecification can be separated with probabil¬ 
ity tending to one. We thus propose a model selection criterion taking the 
form of a penalized likelihood and show it is asymptotically consistent in the 
regime where network average degree grows at a polylog rate. In Section 2, 
we first derive our main results under the regular SBM assumptions and then 
outline how the arguments can be extended to a DCSBM. Computationally 
the likelihood can be approximated with variational algorithms or consistent 
label estimation algorithms without affecting the asymptotic consistency of 
the criterion. We demonstrate the effectiveness of our method by comparing 
its performance with other model selection approaches on simulated and real 
networks in Sections 3 and 4. 

2. Results. 

2.1. Preliminaries. A SBM with K blocks on n nodes is defined as fol¬ 
lows. A vector of latent labels Z = (Z ±,..., Z n ) is generated with Z^ taking 
integer values from [K] = {1,..., K} governed by a multinomial distribution 
with parameters n = (7Ti, 7r2,..., ttr)- Given Zi = a, Zj = b, an adjacency 
matrix A is generated with 

A,j\(Zi = a, Zj = b) ~ Bernoulli(IL a)b ), i / j. 

We consider a symmetric A with zero diagonal entries corresponding to 
an undirected graph, although our arguments generalize easily to directed 
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graphs. H is a K x K symmetric matrix describing the connectivities within 
and between blocks. We denote the model parameters 9 = (7 r, H ) and let 
©A' be the parameter space of a A"-block model, 

A 

©A- ={9 | vr G (0, 1 ) K , £ 7T 0 = 1, A e (0, l) XxA '}. 

a=l 


Throughout the paper, 6* = (ir*,H*) will denote the true generative pa¬ 
rameter giving rise to an observed A. We will further parametrize H* by 
H* = p n S *, where the degree density p n may be f2(l) or going to zero 
at a rate np n /\ogn -A 00 . We assume 9* G &k and H* has no identical 
columns, meaning the underlying model has K blocks and it is identifi¬ 
able in the sense that it cannot be further collapsed to a smaller model. 
z = (z \,... , z n ) G [K'] n represents another set of labels under a K '-block 
model with K l not necessarily equaling K. g(A ; 9) is the likelihood function 
describing the distribution of A with parameter 9 G 0 k 1 and can be written 
as the sum of the complete likelihood function f(z, A ; 9) associated with the 
labels z G [. K ’] n : 

( 2 . 1 ) g{A-0)= £ f(z,A;6), 

z€[I<'] n 


where f(z, A ; 9) takes the form 


A' 


A' A' 


1/2 


\n a ,b(z)-Oa,b( z ) 


f( Z , a- e )=m < a(z) n n a - 

\a=l / \a=lb=l 

with count statistics 

n n 

n a(z) = ^ I (Zi = a), n at b(z ) = ^2^2 Hzi = a , zj = b ) 
i=l i=l j^i 

n 

Oa,b(z') = ^ ^ ^ ^ = Zj = j. 

i=l jz/zi 


g and / are invariant with respect to a permutation on the block labels, 
r : [K 1 ] -A [AT'], and its corresponding permutations on the node labels z 
and the parameters 9. Furthermore, let R(z) be the K' x K confusion matrix 
whose ( k , u)-th entry is 

n 

Rk,a(z, Z) = rT 1 Y2 K z i = k,Zi = a). 

i=1 


( 2 . 2 ) 
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We take a likelihood-based approach toward model selection and first 
investigate whether different model choices can be separated using the log 
likelihood ratio 


(2.3) 


Lk.k' 


log SUPege ^' 9<kA] 6 1 
su P<9e0x 4; 9 ) ' 


Here the comparison is made between the correct it-block model and fitting 
a misspecified it'-block model. 

In the following sections, we analyze the asymptotic distribution of Lrri 
for K' I\. The main focus of analysis lies in handling the sum in (2.1) 
which contains an exponential number of terms. It has been shown in Bickel 
et al. (2013) that when 6 £ Qr, sup 0 g @ g(A;9) is essentially equivalent to 
maximizing the complete likelihood corresponding to the correct labels Z, 
sup eg @ A . f(Z,A;9). In the next section, we handle the case of underfitting 
and derive the asymptotic distribution of Lr r /. 


2.2. Underfitting. We start by considering K' = K — 1. Intuitively, a 
(it — l)-block model can be obtained by merging blocks in a iv-block model. 
More specifically, given the correct labels Z £ [ K] n and the corresponding 
block proportions p = (p\ ,... ,pr ), p a = n a (Z)/n, we define a merging op¬ 
eration U a) b(H* , p) which combines blocks a and b in H* by taking weighted 
averages with proportions in p. For example, for H = Ur- \ ,i<(H*,p), 


Him = H 


l,k 


H, 


l,K—l — 


(2.4) ii/v —l,A—1 = 


for 1 < l, k < K — 2; 

PlPR—iH ^ K1 + pip K H* K 
PlPR -1 + PIPK 
Pl <—1 Hr_i r_i + 2pr_iPrH^_ 1 k + PrH^. 


for 1 < l < K — 2; 


K 


P K _i + 2pR-\p K + pfi 


A schematic representation of H is given in Figure 1. 

For consistency, when merging two blocks (a, b) with b > a, the new 
merged block will be relabeled a and all the blocks c with c > b will be 
relabeled c — 1. Using this scheme, we also obtain the merged node labels 
U atb (Z) and merged proportions U atb (p) with [U atb (p)\ a = p a +Pb- 

Constraining the parameters to a smaller model results in a suboptimal 
likelihood and its distance from the likelihood associated with the cor¬ 
rect model can be measured by the Kullback-Leibler divergence, denoted 
Drl{- HO- Let 


71 (x) = xlogx + (1 — x) log(l — x) 






6 


Y.X.R. WANG AND P.J. BICKEL 



1 

• * * K- 1 

K 

1 


— 

— 



— 

— 

j 


; 

K - 1 

l 

l 


\ 


K 

t 

1 


A 

\ 


Fig 1: A schematic representation of how H* is merged to give H = 
Uk-i,k{H* ,p). The green area contains unchanged parameters and the ar¬ 
rows indicate where mergings occur. 


72 (x) = a; logic — x. 


and define 

K—l 

(2.5) Di(a,b)= ^[^a I 6(7r*)]fc[^o ) 6(7r*)]/7i([^a,6(ff*,7r*)] fc) i) 

k,l=l 

When p = tt* and treating the labels Z as fixed parameters, denote Pa\z,h* 
the probability distribution of A. Then the information loss incurred by the 
merging operation U a ,b can be measured by 


(2.6) Dkl (p. 


A\Z,H* 
K 


A\U a>b {Z),U aib {H* ,7T*), 

Y.c,d= i - Di(a, b) J + 0(n), for p n = 0 ( 1 ); 

Efd=1 «72 ( H c,d) ~ ^ 2 ( 0 , 6)1 + 0(n 2 pl), for pn -A 0. 


Thus an optimal merging minimizing -Dkl is essentially equivalent to max¬ 
imizing Dj(a, b). 

We assume the following holds for 8*: 


Assumption 2.1. A unique maximum, exists for max^ a b ^ Di(a,b). 

This assumption is more of a notational convenience than necessity. From 
now on without loss of generality assume the maximum is achieved at a = 
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K — 1 and b = K, and denote H' = Uk-i,k(H*, n*), S' = H'/p n and 
Z' = Uk~i,k{Z). We also assume H' is identifiable in the sense that 

Assumption 2.2. S' has no identical columns. 


Thus the merged model cannot be collapsed further to a smaller model. 
The next lemma argues sup 0e @ K1 g(A ; 6) is essentially dominated by the 
complete likelihood associated with the optimal merging. 


Lemma 2.3. Let S(z ) be the set of labels which are equivalent up to a 
permutation r, 5(z) = { t(z ) | t : [K — 1] —>• [ K — 1]}. Then 

(2.7) ^2 SU P /(A-4; 60= sup f(Z',A;6)o P ( 1). 

z^S(Z') 0ee, K-i Oee K -t 


The proof is shown in the Appendix. 

This lemma provides a tractable bound on sup ee 0 K1 g(A\ 6), allowing 
the rest of the analysis to be carried out by usual Taylor expansion. 

Define 


mon = 2 


K 


Di(K- 1,K)~ J2 (Kd) 

c,d= 1 

i 


Ml?*) = MI (o*) + - {{n*K- 1 + tt*k) log(vr^_ 1 + ir* K ) - tt^ log7r| c _ 1 - tt* k logvr^} 


Upon merging blocks K — 1 and K, denote u{a) as the new block label of 
block a, and define di(a,b ) such that 


ii («.« = Ki, fog + (!-<„) ^g 


( 2 . 8 ) d 2 (a,b) = S* tb log 


TT* 

H a,b 

Q f 

°u(a),u(b) 


-H 


a,b 


S: 


a,b 


+ (S'u(a),u{b) ~ $a,b) 


The following theorem gives the asymptotic distribution of Lk,k- i> the 
proof of which is shown in the Appendix. 


Theorem 2.4. Suppose the underlying model parameter generating A is 
8* = ( n*,H *) G 0 k , then Lk,k -i is asymptotically normal with 

n~ 3/2 L KtK _i - y/nm(8*) iV(0, of(0*)), if p n = D(l); 

(2.9) p~ l n~ 3/2 L K ^ K -i - p~ 1 Vnp 2 (0*) iV( 0 , of ( 0 *)), if p n ~* 0 . 
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when Assumptions 2.1 and 2.2 hold. Let £(7 t*) be the covariance matrix of 
a multinomial (tt*) distribution, that is H a ,a( 7r*) = 7r*(l — n*) and E ai b = 
~ 7T a 7r b f or a ^ b. The variance of (9*) is given by J T J{9*) for 
i = 1,2, where J(6*) = ( Jb{0*))i<b<K, 

j (q*\ = i^K-idiib, K - 1) +TT* K di(b,K) for 1 < b < K - 2 
1 Kdi (a,b) + Tr%di ( b , 6) /or b = K - 1, K. 

Remark 2.5. (i) A special case occurs for K = 2, = tt 1 = 2 - 

In this case, of (9*) = 0 and Pn 1 n~ 3 ^ 2 LK,K-i converges to its asymptotic 
mean. In general, for homogeneous block models with H* a = h and H* b = g 
for a / b, J(9*) simplifies to Jb(0*) = 0 for b < K — 2, Jk~i{9*) = 

7 r* K _ } di(K, K)+ir* K di(K- 1, K), and J K (d*) = ir* K _^di(K- 1, K)+7r^d i (iC, AT) 

(ii) In general, underfitting a A' - < K model will lead to the same type 
of limiting distribution under conditions similar to Assumptions 2.1 and 2.2, 
assuming the uniqueness of the optimal merging scheme and identifiability 
after merging. That is, 

(2.10) p~ 1 n~ 3/2 L K K - - pf l \/n,p N (0, cr 2 ) 

for some mean p ~ Cp n and variance o 2 . The proof will be similar but 
involve more tedious descriptions of how various merges can occur. 

(iii) The asymptotic distributions derived under the null distribution of 
a A'-block model suggest one might consider performing hypothesis test¬ 
ing directly to compare against an alternative simpler model. However, the 
asymptotic means depend on the true parameters, and its maximum likeli¬ 
hood estimate converges only at the rate y/n (Bickel et ah, 2013). 

(iv) Without Assumptions 2.1 and 2.2, it is easy to show 

(2.11) L kk ~ < — Llp(n 2 p n ), 

where f2(-) denotes asymptotic lower bound, using the method in proving 
Theorem 2.7. 

2.3. Overfitting. In the case of overfitting a A" + -block model with K + > 
AT, deriving the asymptotic distribution of Lj< k+ is niuch more challenging. 
Intuitively, embedding a AT-block model in a larger model can be achieved 
by appropriately splitting the labels Z and there are an exponential number 
of possible splits. We first show a result analogous to Lemma 2.3. However, 
the number of summands involved in sup^gg g(A; 9) remains exponential 
this time. 
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Recall that for z € [K + ] n , R(z, Z) is the K + x K confusion matrix. We 
first define a subset V^+ G [K + ] n such that 

Vj{+ = {z G [K + ] n | there is at most one nonzero entry in every row of R(z, 

V K + is obtained by splitting of Z such that every block in z is always a subset 
of an existing block in Z. The next lemma shows it suffices to consider only 
the subclass of labels Vp+ in the sum g(A\ 6 ), the proof of which is given in 
the Appendix. 

Lemma 2.6. Suppose 9* G @k, then 

Y sup f(z,A;9) = (1 + o P (l)) Y SU P f(z,A-,6). 

ze[K+]™ de0 K+ zeV K+ eee K+ 


The lemma does not provide a direct simplification of the sum and sug¬ 
gests the reason why obtaining an asymptotic distribution for L K K + is dif¬ 
ficult. On the other hand, with appropriate concentration we can still derive 
the asymptotic order of the statistic. 

Theorem 2.7. Suppose 6* G @k, then overfitting by a K + -block model 
with K + > K gives L K K + = 0 P (n 3 / 2 pi /2 ). 

The proof is provided in the Appendix. 

2.4. Model selection. The results in the previous sections lead us to con¬ 
struct a penalized likelihood criterion for selecting the optimal block number. 
The criterion is consistent in the sense that asymptotically it chooses the 
correct K with probability one. Define 

(2.12) fi(K') = sup log g(A;9) - N K ,B n , 

0e© A / 

where B n gives the order of the penalty term, and Npr is a strictly increasing 
sequence indexed by K' describing the complexity of the model. The optimal 
K 0 is such that 

(2.13) Kq = argma xfi(K'). 

K' 

Corollary 2.8. For K' < K, setting B n = o{n 2 p n ), 


(2.14) 


We*{P{K')<fi{K))^l. 
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For K' > K, setting B n such that B n n 3 ^ 2 p n 1 ^ 2 -A oo, 
(2.15) F e *(p(K') < P(K)) Al. 


Proof. For K' < K, generalizing Theorem 2.4, 


(2.16) 


P e * (P(K') < 0(K)) 


n 


-3/2-1 


Pn log 


su Pege K / g(^; °) 
su Pee©x sO 4 ; 0) 





< (n k , - n k )-^~ 

n 6 / z p n 



"Al) 


since B n = o(n 2 p n ) and —p~ 3 p > C(9*) for some positive constant depend¬ 
ing on 8*. In general the same conclusion holds by Remark 2.5 (iii). 

For K' > K, using Theorem 2.7, 


P„. (fJ(K') < f)(K j) 

\rF/ 2 pJ sn Peee K ff(A; 0) 


< (N K i — N k )- 


B r 


n 3 / 2 pl/ 2 


(2.17) -Al, 


when B r 


n 


-vyA 2 


oo. 


□ 


Since the ratio of the upper bound n 2 p n and the lower bound n?! 2 pj 2 
tends to infinity, such a sequence B n exists. Choosing B n in this interval, 
we have Kq = K with probability tending to 1. However, we also note that 
for finite cases with moderate-sized n, xfnp ~ 1 p, in (2.16) is small, making it 
easy to over penalize with large B n . At the same time, the lower bound in 
Theorem 2.7 is not tight and can be refined further. 

We further assume the following holds for tractable approximation. 


Assumption 2.9. The maximum is achieved in the set Mk+ = {z E 
Vp-+ I Rfc(z) > for all k, for some e > 0 , }. 


Assumption 2.9 assumes the maximum can only be achieved on a loosely 
balanced block design. The assumption and Lemma 2.6 imply it remains to 
analyze the order of rnax z& v K+ sup 0 g 0 A , + log f(z, A; 9). The following theo¬ 
rem shows the order of L K K + can be refined to Op(n). The details can be 
found in the Appendix. 
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Theorem 2.10. Under Assumption 2.9, L KK + is of order Op(n ) for 


K+ > K. 


It follows then choosing B n growing slightly faster than n will ensure con¬ 
sistency in the sense described in Corollary 2.8. Thus we choose a penalized 
likelihood of the following form, 


(2.18) P(K') = sup log g{A\ 0) - X ■ — ^ + ^ nlogn, 


e&e K i 


where the complexity term corresponds to the number of parameters in the 
edge probability matrix and the constant A is a tuning parameter. Similar 
to many BIC-type criteria, choosing the tuning parameter is a challenging 
problem even though it does not affect the asymptotic properties. We discuss 
this problem in Section 3. 

2.5. Extension to a degree-corrected stochastic block model. In practice, 
the SBM often oversimplifies the community structures by assuming all the 
nodes within a block have the same expected degree, thereby excluding net¬ 
works with “hub” nodes and other possible degree variations within blocks. 
To address this limitation, Karrer and Newman (2011) proposed the degree- 
corrected stochastic block model (DCSBM) by setting 


(2.19) 


E (Aij | Z, lo) = ujiUjjHz^Zj , i / j 


where u = (wi,..., ui n ) is the set of node degree parameters with some 
identifiability constraint. 

As before, Z ~ Multinomial(7r). We also treat w as a latent variable and 
assume uii \ Z ~ n k (Z) ■ Dirichlet(l) for Z % = k so that u satisfies the 
identifiability constraint = n k for every k. Similar to Karrer and 

Newman (2011), we replace the Bernoulli likelihood by the Poisson likelihood 
and assume A,^ ~ Poisson {uj'f Hz^z, /2) to simplify derivation. As noted in 
Karrer and Newman (2011) and Zhao et al. (2012), sparse networks are well 
approximated by the Poisson distribution and little difference was found in 
practice between the two choices. The assumption on the diagonal entries 
also does not change the asymptotic results. Therefore given (Z,u), the log 
conditional likelihood of A is (up to a constant) 



= ^2 A hi lo S w * + \ *22 (°kA Z ) lo S H k,i ~ n k (Z)ni(Z)H kt i) . 


( 2 . 20 ) 
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In this case, the likelihood function f(Z,A;8) has a tractable form and 
one can show Lemma 2.3 holds provided n 1//2 p n j logn —> oo. The stricter 
condition on the degree density ensures even with node degree variations 
(in the worst case E(-Ajj | Z, uS) ~ p n /n) there still exist enough edges for 
parameter estimation. Similar arguments apply to show that the criterion 
(2.18) is asymptotically consistent for this DCSBM. As the derivation is 
largely similar to the regular SBM case, we provide a proof sketch in the 
supplementary material. 

2.6. Likelihood approximations. In practice, direct computations of the 
likelihood function g(A ; 6) and its supremum involve an exponential number 
of summands and quickly become intractable as n grows. In this section we 
provide practical ways to approximate the likelihood and discuss conditions 
under which asymptotic consistency is preserved. 

Variational likelihood for regular SBM. Using the EM algorithm to op¬ 
timize over 6 requires computing the conditional distribution of Z given A, 
which is not factorizable in this case. Variational methods tackle the true 
conditional distribution fz\A-fi with the mean field approximation, thus sim¬ 
plifying the local optimization at each iteration. Under the regular SBM, 
the variational log likelihood J(q, 8 ; A) for a AT'-block model is defined as 

(2.21) J(q, 8 ; A) = -D KL {q\\f z]A . e ) + log g{A- 8), 

where q £ T>k> is any product distribution with q(z) = n”=i Qi( z i ) 5 1 < Zj < 
K'. The variational estimates is given by 

f?YY R = arg max max J(q,8;A ), 

eee K ,q€V K , 

which can be optimized using the EM algorithm in Daudin, Picard and 
Robin (2008). Also we note that J(q,8',A) simplifies to 


n K' 

J(q,8-A) =EE &(&)(-log&(*;) + log 7r(fc)) 

i =1 k =1 
_ K' 

+ ^2^2 9i( fc )9j(0 ( A ij !og H k} i + (1 - Aij) log(l - H k j)) 

i<j k,l =1 


and hence can be easily evaluated. 
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We can replace the likelihood in (2.18) by the variational log likelihood J 
without changing its asymptotic performance. More precisely, the criterion 
with variational approximation 

(2.22) /3 var (K')= sup sup J{q,6-,A) - X- K ^ K ' + 1 \ logn 

eee K , qev K , 2 

is still asymptotically consistent. Noting that 

(i) J(q, 6; A) < log g(A;6) for any q € V K r, 

(ii) sup 0e0x swp qeT>K J(q, <9; A) -sup 0e0/f logg{A] 6) = O p { 1) as shown in 
Bickel et al. (2013), 

it can be easily verified that (2.16) and (2.17) still hold. Although (ii) applies 
to the global optimum of J{q , 0 ; A) which may not be achieved by the EM 
algorithm, we note that it can be relaxed to accommodate for the difference 
in practice. Provided the difference between the local optimum found by the 
algorithm and the global optimum is bounded by op (n log n), asymptotic 
consistency still holds. 

Label estimation. The computation time of variational likelihood grows 
quickly with network size and becomes more complicated for degree cor¬ 
rected models. On the other hand, a number of algorithms are available for 
estimating the latent labels in a computationally efficient way under both 
regular SBM and DCSBM. Typically these algorithms require specifying 
the block number, hence let Z(K') be the estimated labels corresponding 
to block number K'. Then fML(Z(K'), A) = supg e0 ^ ; f(Z(K'),A]0) is the 
maximum complete likelihood by plugging in the estimated labels. We as¬ 
sume that the estimation algorithm is strongly consistent with the same 
convergence rate as in Theorem 1 of Bickel and Chen (2009). 

Assumption 2.11. There exists a sequence b n —> oo such that 

(2.23) P (Z(K) / r(Z)) = 0(n~ bn ), 
where t is a permutation on [K]. 

Such a convergence rate can be achieved by e.g., profile maximum like¬ 
lihood (Bickel and Chen, 2009). For computational efficiency we will use 
the pseudo-likelihood algorithm developed by Arnini et al. (2013), which is 
available for both regular SBM and DCSBM. Weak consistency for label 
estimation was shown in Arnini et al. (2013). However, the plug-in estimates 
of the block model parameters are still consistent. Under Assumption 2.11, 
observing that 
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1- fML(Z(K'),A) < sup 0ee#f , log g(A;0)] 

2. log f M L(Z(K),A) = sup 0eeK logg(A-,6)(l + o P (l)), 


it is easy to see (2.16) and (2.17) still hold, and the criterion 


(2.24) P ml {K’) = log /ml (Z(K'),A) - A • R ' (A ' + ^ nlogn 


is asymptotically consistent. 

In the next section, we use simulated data to demonstrate how the cri¬ 
terion performs in practice. We approximate the likelihood using the varia¬ 
tional EM algorithm for regular SBM and the pseudo-likelihood algorithm 
for DCSBM. 

3. Simulations. 

3.1. Goodness of fit. We first examined how well the normal limiting dis¬ 
tribution approximated the empirical distribution of the statistic in the case 
of underfitting. Figure 2 plots the distribution of u~^^ 2 Lk,k-i for n = 200 
and n = 500 obtained from 200 replications for the following two scenarios: 



The log likelihoods are approximated by the variational EM algorithm ini¬ 
tialized by regularized spectral clustering (Joseph and Yu, 2013). The solid 
curves are normal densities with mean fj,2(0*) and <r(0*) given in Theorem 
2.4. Even though the O(n) term in y,2{0*) diminishes asymptotically for p n 
going to 0 slowly, we found it essential to correct for the bias in the fi¬ 
nite sample regimes above. In both cases, the convergence to the Gaussian 
shape appears faster than the convergence to the mean, and a bias exists 
for n = 200. When the network size reaches 500, the empirical distribu¬ 
tions are well approximated by their limiting distribution. We note that the 
bias should not have an adverse effect on model selection since it is in the 
direction away from zero, making it easier to separate the two models. 


3.2. Selection of tuning parameter. Before we investigate the finite sam¬ 
ple performance of our model selection criterion, we note that (2.18) involves 
a tuning parameter A. We next propose a heuristic scheme for selecting A. 



Density Density 
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(c) (d) 

Fig 2: Empirical distributions of n~ 3 ^ 2 Lx,K-i for (a), (b) K=2, it* and H* 
as described in scenario (a); (c), (d) K=3, n* and H* as described in scenario 
(b). n = 200 in (a) and (c); n = 500 in (b) and (d). The solid curves are 
normal densities with mean 112 ( 6 *) and cr( 6 *) as given in Theorem 2.4. 


Similar to implementing BIC or AlC-type criteria, a maximum block num¬ 
ber JF max to be fitted needs to be chosen first. Then A is chosen by the 
following algorithm: 
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1. For each choice of A, compute /3(K') for K' = 1,, K max . 

2. Normalize (—/3(1),..., — /3(A' max )) into a probability vector (w ±,..., rcA' max 
so that they sum to 1. 

3. Choose A that maximizes the entropy — w k log (wk). 

4. If ties exist, choose the largest A. 

Here f3(K r ) can be computed from either the variational likelihood (2.22) or 
the plug-in maximum likelihood (2.24). Heuristically, this algorithm chooses 
a A that maximizes the “peakedness” of the profile of the penalized likeli¬ 
hoods (/3(1),..., /3(K max )) and hence the amount of signal contained in it. In 
the following sections, A was chosen in the interval [0, 0.3] with an increment 
of 1 X 10~ 3 ; /\ max = 10 for all simulated data. 

3.3. Performance comparison with other methods. To see how our crite¬ 
rion (denoted plh for penalized likelihood) performs against other existing 
model selection methods, we compare its success rate with variational Bayes 
(Latouche, Birmele and Ambroise (2012), denoted vb) and the 3-fold net¬ 
work cross validation method in Chen and Lei (2014) (denoted ncv). Since 
vb is only available for regular SBM, only ncv is included for DCSBM. plh 
is computed via either the variational EM for regular SBM or the pseudo 
likelihood algorithm (Arnini et ah, 2013) for DCSBM. 

In Figure 3 shows the average success rates of all three methods for data 
generated from regular SBM. 50 networks of size 500 were generated for each 
parameter set with K = 2, 3,4, H* = pS *, and p € {0.02,0.04,..., 0.1}. The 
average degrees of these networks range from around 12 to 75. In general, 
the success rate of each method decreases as the networks become sparser 
and K increases, since the task of fitting also becomes harder. Overall plh 
outperforms the other two methods. 

Next we simulated networks from DCSBM. To test if our method also 
works for more general DCSBM parameter settings and not limited by the 
specific Dirichlet prior assumption on ui, we generated degree parameters 
uj from a Unif(0.2, 1) distribution and further normalized them so that 
Ylz =k u i = n k(Z). We set H* = pS* as in the previous case with varying 
p. Binary adjacency matrices were generated even though our model in Sec¬ 
tion 2.5 is Poisson. As the simulation confirms, the approximation works well 
when networks are sparse. Table 1 shows the average success rates of plh 
(calculated using the pseudo likelihood algorithm) and nvc for 50 networks 
of size 800 for each parameter set. Overall the problem is harder in this case 
than regular SBM as the inclusion of degree parameters induces more spar¬ 
sity in some regions of the networks, plh shows a significant improvement 
over ncv in almost all cases. 
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p p p 


(a) (b) (c) 

Fig 3: Comparison of the success rates of the penalized likelihood (plh) 
with variational Bayes (vb) and network cross validation (ncv). For every 
parameter setting, 50 networks were simulated from regular SBM with (a) 
K = 2, 7T = (0.4, 0.6); (b) K = 3, vr = (0.3,0.3, 0.4); (c) K = 4, tt, = 0.25 for 
all i. In all the cases, H* = pS*, where p G {0.02, 0.04,..., 0.1}, the diagonal 
elements of S* equal 2 and the off diagonal elements equal 1. 


K = 2 K = 3 K = 4 

p 0.02 0.04 0.08 0.02 0.04 0.08 0.02 0.04 0.08 

plh 0.88 0.96 1 0.08 0.66 1 0.12 0.64 0.98 

ncv 0 0.26 1 0 0 0.54 0 0 0 

Table 1 

Comparison of the success rates of the penalized likelihood (plh) with network cross 
validation (ncv). For every parameter setting, 50 networks were simulated from the 
DCSBM with (a) K = 2, n — (0.4,0.6); (b) K = 3, n = (0.3, 0.3,0.4); (c) K = 4, 
m = 0.25 for all i. In all the cases, H* = pS*, where p £ {0.02, 0.04, ..., 0.1}, the 
diagonal elements of S* equal 2 and the off diagonal elements equal 1. 


4. Real world networks. In this section, we examine the performance 
of our method on real world networks. We set K majX = 30 for the Facebook 
networks and K max = 15 for the others. We first implemented our method 
along with vb and ncv on nine Facebook ego networks, collected and labeled 
by Leskovec and Mcauley (2012). An ego network is created by extracting 
subgraphs formed on the neighbors of a central (ego) node. Any isolated 
node was removed before analysis. Fitting regular SBM to these networks, 
variational EM was used to compute plh. The actual sizes of the networks 
and the number of communities selected by the three methods are shown 
in Table 2. The third row of the table shows the number of friend circles in 
every network with some individuals belonging to multiple circles, but not 
every individual possesses a circle label. These circle numbers give partial 
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truth on how many communities there are in the networks. Overall plh gives 
estimates closer to the circle numbers when the network is reasonably large 
and the number of circles is moderate, vb performs better on networks with a 
large number of circles but also overfits in a few cases, ncv tends to produce 
smaller community numbers. 


# Non-isolated vertices 

333 

1034 

224 

150 

168 

61 

786 

534 

52 

Average degree 

15 

52 

29 

23 

20 

9 

36 

18 

6 

# Circles 

24 

9 

14 

7 

13 

13 

17 

32 

17 

plh 

6 

7 

6 

4 

6 

6 

9 

9 

6 

vb 

11 

24 

16 

9 

11 

6 

25 

23 

6 

ncv 

3 

6 

4 

2 

4 

2 

2 

2 

3 


Table 2 

Facebook ego networks and the number of communities selected by the three methods. 


We also implemented these methods on the political book network (New¬ 
man, 2006b), which consists of 105 books and their edges representing co¬ 
purchase information from Amazon. Again we treated this as a regular SBM 
and used variational EM to fit plh. Figure 4 (a) shows the manual labeling of 
the books based on their political orientations being either “conservative”, 
“liberal” or “neutral”. As shown in (b), plh estimated I\ = 6 and essentially 
splits each of the communities in (a) into two, suggesting the presence of 
sub-communities with more uniform degree distributions, vb found 4 com¬ 
munities but merged two communities in (a) into one. ncv selected K = 2 
and also merged two clusters in (a). 



Fig 4: Communities in 105 political books based on (a) manually curated 
ground truth; (b) penalized likelihood; (c) vb; (d) ncv. 


Finally we used the political blog network (Adamic and Glance, 2005) as 
an example of DCSBM. The network consists of blogs on US politics and 
their web links as edges. Based on whether a blog is “liberal” or “conserva- 
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tive”, the network is divided into two communities. As is commonly done in 
the literature, we considered only the largest connected component contain¬ 
ing a total of 1222 nodes. In this case, plh selected K = 4, splitting one of 
the communities into three as shown in Figure 5. ncv selected K = 2. On 
the other hand, we have also observed in simulations that ncv tends to have 
a bias toward lower K for DCSBM. 



(a) (b) 


Fig 5: Political blog network (a) manually labeled ground truth; (b) penal¬ 
ized likelihood. 


5. Discussion. In this paper, we have studied the problem of selecting 
the community number under both regular SBM and DCSBM, allowing the 
average degree to grow at a polylog rate and the true block number being 
fixed. We have shown the log likelihood ratio statistic has an asymptotic 
normal distribution when a smaller model with fewer blocks is specified. In 
the case of misspecifying a larger model, we have obtained the convergence 
rate for the statistic. Combining these results we arrive at a likelihood-based 
model selection criterion that is asymptotically consistent. For finite-sized 
networks, we have further refined the bound for the statistic in the over¬ 
fitting case under reasonable assumptions to correct for the possibility of 
over-penalizing. Our method shows better performance than vb and ncv on 
simulated data and produces sensible results on a range of real world net¬ 
works. We also note that vb is only available for regular SBM and ncv tends 
to be highly varying from run to run due to its use of random partitions. 

There are a number of open problems for future work, (i) It would be 
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interesting to investigate whether the results can be extended to other block 
model variants, such as overlapping SBM (Airoldi et al., 2008; Ball, Karrer 
and Newman, 2011). (ii) We have performed our analysis with fixed block 
number as the number of nodes tends to infinity. However, in practice the 
number of communities is also likely to grow as a network expands (Choi, 
Wolfe and Airoldi, 2012), especially when we view block models as histogram 
approximations for more general models (Bickel and Chen, 2009; Wolfe and 
Olhede, 2013). Peixoto (2013) has provided some analysis on the maximum 
number of blocks detectable for a given SBM graph with fixed labels. In gen¬ 
eral as more time-course network data become available in biology, social 
science, and many other domains, incorporating dynamic features of com¬ 
munity structures into network modeling will remain an interesting direction 
to explore. 

APPENDIX A: PROOFS OF LEMMAS AND THEOREMS 

In this section, we prove all the lemmas and theorems in the main pa¬ 
per. Denote fx n = n 2 p n , the total number of edges L = Y^i=i J2J=i+i Ai,j> 
and N(z) = (nk,i(z))i<k,i<K'■ For two sets of labels 2 and y, \z — y\ = 
7^ Vi)- || • ||oo denotes the maximum norm of a matrix. We abbrevi¬ 
ate R(z, Z)S*R(z , Z) T as RS*R T (z). C, C\, ... are constants which might be 
different at each occurrence. The following concentration inequalities bound 
the variations in A and will be used throughout the section. 

Lemma A.l. Suppose z £ [K'] n and define X(z) = 0(z)//x n —RS*R T (z). 
For e < 3, 

(A.l) p( max ||X(z)|| 00 > e\ < 2{K') n+2 exp {-C^S*)^) . 

\ze[K , \ ri j 

Let y e [K'] n be a fixed set of labels, then for e < Sm/n, 

P f max \\X(z)-X(y)\\ 00 > e 

\z:\z-y\<m 

(A.2) <2f n W') m+2 exp (-OfiSn ne2lIn ) . 

\mj \ m J 

C\(S*) and C- 2 (S*) are constants depending only on S*. 


PROOF. The proof follows from Bickel and Chen (2009) with minor mod¬ 
ifications for general AT'-block models and correcting for the zero diagonal 
in A. O 



LIKELIHOOD-BASED MODEL SELECTION FOR SBMS 


21 


Recall that 


71 (x) = xlogx + (1 — x) log(l — x), 

72 (x) = xlogx — x. 


Define i^(M, t), i = 1 , 2 , as 
(A. 3 ) 


A" 


Fi(M,t ) = ^ tfe.iTi 

k,l=l 


Mk, 

tk,l J ' 


Then the log of the complete likelihood can be expressed as 


K 2 

n 


sup log f(z, A] 6 ) = n y 2 a(n k (z)/n) + —F 1 ( 0 (z)/n 2 ,N(z)/n 2 ) 
e&e K , '■ 


k =1 


(A. 4 ) 


where a(x) = xlog(x). Noting the first term is of smaller order compared to 
the second term, and the conditional expectation of the argument in 71 given 
Z is [RH*R T {z)]k,i/[Rll T R T (z)} k ,i and [RS*R T {z)} k ,i/[RU T R T {z)\ k ,t for 
72 (up to a diagonal difference) with fluctuation bounded by Lemma A.l, 
we will focus on analyzing the conditional expectation 


(A. 5 ) 

G 1 (R(z),S*) 

(A.6) 

G 2 (R(z),H*) 


A' 


k,l =1 


/ [RH*R T (z)] kil \ 
\[Rll T R T (z)} k J 


K’ 

Y [R llT R T (z)\k,H2 


k,l =1 


/ [RS*R T (z)] k ,i \ 
\[Rll T R T (z) } k J 


for p n = D(l), 


for p n -A 0. 


The following lemma shows in the case of underfitting a (K — l)-block model, 
to maximize Gi over different configurations of R(z, Z) with given Z. it 
suffices to consider the merging scheme described in Section 2.2 by combining 
two existing blocks in Z. 


Lemma A. 2 . Given the true labels Z with block proportions p = n(Z)/n, 
maximizing the function G\(R(z), H*) over R achieves its maximum in the 
label set 


{z G [K — l] n | there exists r such that t(z) = U a j,{Z), 1 < a < b < K ,} 
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where U a ,b merges Z{ with labels a and b. 

Furthermore, suppose zq gives the unique maximum (up to permutation 
t), for all R such that R >0,R T l=p, 


(A.7) 


5Gi((l 


e)R(z 0 ) + eR,H*) 
de 


<-C < 0 

e=0+ 


for p n = fi(l). The same conclusions hold for G 2 (R{z), S*). 


Proof. Treating R as a (K — 1) x A'-dimensional vector, it is easy to 
check Gi(-,H*) is a convex function. Furthermore, since R > 0, R T 1 = p, 
the domain is part of a convex polyhedron Pp> = {R £ | R > 

0, R t 1 = p}. Therefore the maximum is attained at the vertices of Pn, that 
is R vert such that for every a, exactly one RZ e f t , (1 < k < K — 1) is nonzero. 
This is equivalent to assigning all Z{ £ [K] with the same label into one 
group with a new label in [K — 1]. Let u : [K ] — > [K — 1] be the function 
specified by R vert , then 


(A.8) 


Gl (R^t jH *) = Y^ PoPbT 1 

k,l a£u~ 1 (k), 
b£u~ 1 (/) 


/Saeu _1 (fc), ^afiPaPb 
b£u~ 1 (l) 

VciVb 


Note that there exists at least one l £ [K — 1] such that |{n _1 (/)}| > 1, and 
k £ [K — 1]} forms a partition on [K]. By strict convexity of 71 and 
identifiability of H*, to maximize G\ it suffices to consider merging two of 
the labels in [. K ] and mapping the other labels to the remaining labels in 
\K — 1] in a one-to-one relationship. 

The second part of the lemma holds since it is easy to see when the 
maximum is unique, the derivative of the G\ at the optimal vertex is bounded 
away from 0 in all directions. The same arguments apply to G2. D 


Noting that when p = n*, Gi evaluated at R(U at b{Z)) is equal to D* 
defined in (2.5), it is easy to see Assumptions 2.1 and 2.2 guarantees the 
maximum is unique. We will now prove Lemma 2.3. 


Proof of Lemma 2.3. Taking the log of the complete likelihood, 
sup log f(z,A;6) 

0ee K -i 

k- 1 2 

=n ^2 OL(n k {z)/n) + yFi (O(z)/n 2 , N(z)/n 2 ) . 
k =1 






LIKELIHOOD-BASED MODEL SELECTION FOR SBMS 


23 


(A.9) 

By concentration of p k , it suffices to consider {||p — 7 r*||oo < 77 }, where r/ is 
small enough that Z l remains the unique maximizer of G\(R(z), H*) and 
G 2 (R(z), S*), and distribution conditional on Z. 

Using techniques similar to Bickel et al. (2013), we prove this by consid¬ 
ering 2 far away from Z' and close to Z' (up to permutation r). Let d n be 
a sequence converging to 0 slowly. Define 

h n = {ze [K - 1]- : G\{R{z),H*) - Gi(i2(Z'), H*) < -5 n }. 

First by (A.l) in Lemma A.l, for e n —> 0 slowly, 

\Fi (0(z)/n 2 , N(z)/n 2 ) — G\(R(z), H*)\ 

<C • | O kjl (z)/n 2 - (RH*R T (z)) k}l \ + O^" 1 ) 

k,l 

(A.10) =op(e n ) 

since 71 is Lipschitz on any interval bounded away from 0 and 1 and min H* = 
fi(l). For z € h n and p n = fi(l), 

V sup e log /UA; 0 ) < sup f (z', A\6){K - i)^ e O{n)+o P {n\ n )-n 2 S n 
zei Sn e&e K-i ee & K -1 

(A. 11 ) = sup f(Z',A-,0)o P ( 1) 

0S0K-1 

choosing 5 n —> 0 slowly enough such that 5 n /e n -A 00 . Similarly for p n —>• 0, 
define 


Js n = {ze [K - If : G 2 (R(z), S*) - G 2 (R(Z'),S*) < -S n }. 

Note that in this case, for e n —> 0 slowly, 

F 1 (0(z)/n 2 ,N(z)/n 2 ) 

=2 log p n L/n 2 + p n F 2 ( 0(z)/fi n ,N(z)/n 2 ) +0 P (p 2 1 ) 

(A. 12 ) =21og p n Llr? + p n G 2 (R(z), S*) + op(p n e n ) + Op(p^), 

by (A.l) and the fact that 72 is Lipschitz on any interval bounded away 
from 0 and 1 and min S* b > 0. Then for z G Js n , 

Y sup e log/(2 ’ A;0) 
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< sup f(^z' 1 A\9){K — \) n e 0 ^ +0p ^ nPn ^ +0p ^ nen ^~ iJ ' n5 ' 

0S0K-1 

(A.13) = sup f(Z\A- 1 B)o P ( 1). 

oeO K _ i 


choosing e n —» 0, 5 n —> 0 slowly enough. 

For z Js n , \G 2 {R(z), S* *) — G 2 (R{Z'), S*)\ -a 0. Let z = min r \t(z) — 
Z'\. Since the maximum is unique up to r, || R(z) — R(Z’)\\ oa —> 0 and 

IE k a M*)/n) ~ J2k a (M z ’)/n)\ -> 0. 

By (A. 2), 


(A.14) 


max \\X(z) ~ X(Z')\\ 00 > e\z — Z'\/ri\ 
iS(Z') J 

<5Z P f ™ax ll x (") “ x< y z ')\\™ > e ~) 
— J \z:z=z.\z—Z'\=m 71 J 

m =1 x 1 7 

n 

< 2 (K - 1 ) K ~ l n m (K - l) m+2 exp (-C^) 


It follows for \z — Z'\ = m, z ^ 

= op(l)^. + ||i2S*i2 T (z) - RS*R T {Z') Hoc 
n 

TYl 

(A.15) > —(C + op(l)). 

n 

Observe ||0(Z / )//r n — RS*R(Z , )\\ 00 = op(l) by Lemma A.l, N(Z')/n 2 = 
ittl T i? T (Z') + o(l) on {||t? — 7T*||oo < 77}, and F2(■> ■) has continuous deriva¬ 
tive in the neighborhood of (O(Z’)/fj, n , N(Z')/n 2 ). Using (A.7) in Lemma 
A.2, 

aft (a-<)® +£ M,d-<)S?! +(t ) 

9e 

for ( M,t ) in the neighborhood of (O(Z’)//j, n , N(Z')/n 2 ). Hence 

*2 (0(f)/ /i n , N(z)/n 2 ) - F 2 (O(Z')/fi n , N(Z')/n 2 ) 

777 

(A.16) <-fi P (l) —. 

n 


< -n P (l) < 0 


— n+ 


O(z) 

o( z ') 

l^n 

Mn 


sup log/(z, A;0) - sup log f(Z',A;8) 
Og&k-i 0£&k -i 


We have 
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<n 


K -1 

Y a{n k (z)/n) 
k= 1 


a{n k (Z')/n) 


+ n 2 (. F l (0(z)/p ril N{z)/n 2 )-F 1 {0{Z')/g n ,N{Z')/n 2 )) 

Tn 

< ( 0(n) + o P (g n ) - flp(/i„)) — 

n 

= -«pW- 

n 


(A.17) 


using (A. 12) and (A. 16). We can conclude 


sup e 


log/(z,A;0) 


< sup /(Z 7 , A; 0) V (A - l)^- x n m (A - i)™ e -W>(M™A* 
ee©x-i m=i 

(A.18) = sup /(Z 7 , A;0)op(l) 

deOji -1 


The bounds (A.13) and (A.18) yield (2.7). The case for p n = 0(1) can be 
shown in a similar way. □ 

Now Theorem 2.4 follows by Taylor expansion. 


Proof of Theorem 2.4. First note that 


r . sup 0 e0 g(A; 9) sup e&e g{A;9) 

L K ,K- 1 = log- 777^ -log 


g(A; 9*) g(A;6 

g(A-,e) f(z,A-,e*) 


(A.19) 


96eL l0g lf(Z,A-6*) g{A-e *) J 

= sup lQ g f ri A ' A d L +Op(!) 
0e©x-i J{Z,A,V ) 


+ Op( 1) 


by a consequence of Theorem 1 and Lemma 3 in Bickel et al. (2013). Noting 
that sup eg 0 /(Z 7 , A; 0) is uniquely maximized at (omitting the argument 

Z) 


Tr a = — = 7T* + Op(n 1 / 2 ) for 1 < a < K — 2, tr'-i = 
n 

H a ,b = = H* b + Opiyfp^rT 1 ) for 1 < a < b < K - 

R a, 6 


n A -_i + n# 
n 


2 , 
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H a , K -1 = °:' K ~ l | a '' K = K, K -i + 0 P {p n n- 1 / 2 ) for 1 < a < K - 2, 


(A.20) 

Hk-i,k-i = 


n a ,K-i + n a ,K 

Ha=K-l '^2b=a Oa,l 


sr^K sr^K 

l^a=K -1 2-^6=a n a,b 


— H' k _ i )K _i + Op(p n n 1 /"), 


and Assumption 2.2 the merged S' is identifiable, we have 
suPesOif-i EzeS(Z') f( z > A ‘, e ) 


su P6»e6x-i 

Combined with Lemma 2.3 

g(A-e) 


= 1 + Op(l). 


t ^V og nz,A-p) 

VeI, log 7(W) 


+ Op(l). 


(A.21) 


We will check the expansion for the case p n —> 0; the case p n = J)(l) can 
be shown in the same way. 


n 3/2 Pn 1 sup log 


g{A-o) 


=n~ 3/2 Pn 1 SU P lQ g {If 'f.'nl + °p ( 1 ) 


=n _3 / 2 p“ 1 




K -1 , A-1A-1 

n 

a= 1 a=l fe=l 

K K 


K -1 / 

XI a (^“) + 2 X! X ( log J 


H, 


a,b 


-H, 


a,b 


+ n a ,b log(l - H a ,b) 


X^a lo g< - ^XX (°a,6log Y 

a=l a=l b=l \ 


H. 


a,b 


- H 


+ n afi log(l - H* b ) } + o P ( 1) 


a,6 


=r) —1/2 1 


=n 


Pn 1 [ a ( 7r ft:-l + ^a) — a ( 7r X'-l) _ “(/Lx)] 


n 3/2 ^^ X (°a, 6 log 

(a,6)eX V 


if', . , M (1 - if* ) 1 - if' ,.A 

” ( “ W ‘) “Analog l+Ml). 


(1 - if' 

v 7/ 


u(a),u(b)' a,6 




a,6 


(A.22) 
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where X be the set of indices affected by the merge, 

X = {( a , b) E [K] 2 | K - 1 < a < K or K - 1 < b < K}. 

It is easy to see the expectation of this term is p~ ] y/np 2 , we have 

-3/2 -1 i 9(A'id) /— _i 

n ' p n sup log ' - Vnp n P 2 


2 n 3 / 2 p n 


£ 


(a,6)el L 


EE/ /I _ EE* \ 

2 ee * * *\1 £ 1 u(a),u(b)\ a,b) 

{O a ,b - n H a b n a TT b ) log ^ 

' u(a),u(b)' a,b 


+(n a ,6 - 7T«)log 


1 - H', A 

u(a),u(b) 


1 - H 


a,b 


+ Op(l) 


2n 3 / 2p K,fe-n 2 7r>^)di(a,6)+ o P (l), 


(a,6)el 


(A.23) =-^- ^ (n aj6 /n 2 -«)d 2 (a,^)+ op(l) 

(a,b)ex 

where di(a,b ) is defined in (2.8). (2.9) follows by the delta method. 

0 


Proof of Lemma 2.6. The proof follows using arguments similar to 
Lemma 2.3. Note that in this case Gi(R(z),H*) is maximized at any z E 
V K + with the value T, a ,bPaPbli{H* ab ) (or Y<a,bPo.Pbl 2 (S* afi ) for G 2 (R{z ), S*)). 

It suffices to discuss the case p n -A 0. Denote the optimal G* := Y2 a bPaPbl 2 {S* b ), 
define similarly to Lemma 2.3 

J Sn = {ze [K+} n : G 2 (R(z), S*)-G*< -U 

for 5 n —> 0 slowly enough. It is easy to see 

V SU P sup f( z o,A-,6)o P (l) 

zeJ Sn dee K+ ee0 J?+ 

for any zq E Vk+- 

Next note that treating R(z) as a vector, {R(z) \ z E V^+} is a subset of 
the union of some of the K + — K faces of the polyhedron P R . For every 2 ^ 

J s„, z i V K +, let z± be such that R(z±) := min R ^ Zo):ZoeV k+ \\R(z) - R(z 0 )\\ 2 . 

R(z) — R(z_ ]_) is perpendicular to the corresponding K* — K face. Further¬ 
more, this orthogonality implies the directional derivative of G 2 (-,S*) along 
the direction of R(z) — R(z±) is bounded away from 0. That is 

dG 2 ((1 — e)R(z±) + eR(z), S*) ^ _ 


e=0+ 
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for some universal positive constant C. Similar to (A.17), 


sup log f(z,A;9)~ sup log f(z±, A; 9) < -VL P {g n ) — 

eee K+ ee e K+ n 

sup f(z,A;9) < sup f(z±, A; 9) 

eee K+ o&e K+ 


where \z — z±\ = rn. We have 

E s )- i p 

z^J Sn ,z^V K + dee K+ 

n 

< V sup f(z,A-9) J2( k ~ i) m n m e -np^)f 

z&V K+ 6&& K+ m= 1 

=°p( 1 ) V sup 

z&v K+ e ' e0 /f+ 


Hence the claim follows. 


□ 


Proof of Theorem 2.7. First note 

su Peee x+ 9(' 4; °) 


l k,k+ — log ■ 


f(Z,A-9* 


+ Op( 1), 


where 


(A.24) 


, supege g{A; 9) sup 0g0 f(Z,A-9) 
10(5 /(Z,A;«-) 2 bg /(Z,A;«-) 

= O p { 1). 


Let D(-) be a diagonal matrix, upper bounding by the maximum, 

, sup 0e0 g(A; 9) 

° g f(Z,A-9 *) 

z , f(z,A-,9) + 

< max sup log + n log A 

2 eee K+ J[Z,A-,9*) 

2 

= max {Fi ( 0(z)/n 2 ,N(z)/n 2 ) - Fi (D(p)H*D(p),pp T )} + Op(n) 

2 

< max |Pi (0(z)/n 2 ,lV(z)/n 2 ) - Fi(RH*R t (z), Rll T R T (z)) \ 

z Zj 

2 

+ max [F!(Pi7*i7 T (z),Pll T i? T (^)) - Pi (P(p)H*P(p),pp T )] + 0 P (n) 
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<C^ n max 

Z 


o(zi 

Hn 


(A.25) 

=0 P (n*/V n /2 ) 


rs*r t 


+ Op(n) 


using (A.l) in Lemma A.l, and the fact that 

max F 1 (RH*R T (z),Rll T R T (z )) = F 1 ( D{p)H*D(p),pp T ) . 

z£[K+] n 


□ 


Next we prove Theorem 2.10. 

Proof of Theorem 2.10. To upper bound L KK +, by Lemma 2.6, it 
suffices to consider 

max sup log f(z,A;6)— sup log g( A; 9) + 0(n) 
zeV K+ e&e K+ e&e K 

= max sup log f(z, A; 0) — log f(Z, A; 6*) + 0(n). 
zeV K+ e&e K+ 

It follows from the definition of V^+ there exists a surjective function h : 
[A' + ] -A [K] describing the block assignments in R(z,Z). We have 

max sup log f(z, A; 6) — log f(Z, A; 9*) + 0(n) 
z&v K + eee K+ 

k+ k / 

= max a (rik(z) / n) — n > a > rik(z)/n 

K k =1 a=l \k&h-i(a) 

(A.26) 

K+ K+ 

k =l l=i 

where Hk,i = Ok,i(z)/rik,i(z). The first part of the expression is nonpositive 
since a is superadditive. 

Taylor expanding (A.26) and using the fact that Hk,i~HF k ^ = Op(n -1 / 2 

for z G Mr+ uniformly, (A.26) is upper bounded by Op(n). □ 
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SUPPLEMENTARY MATERIAL 

Supplement to “Likelihood-based model selection for stochastic 
block models” 

(http://???). A proof sketch of how the main results in the paper can be 
extended to the DCSBM described in Section 2.5 is provided in the supple¬ 
ment. 
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SUPPLEMENT TO “LIKELIHOOD-BASED MODEL 
SELECTION FOR STOCHASTIC BLOCK MODELS” 

By Y.X. Rachel Wang* and Peter J. Bickel' 1 ' 

Department of Statistics, Stanford University* 

Department of Statistics, University of California, Berkeley t 

We provide a sketch of a proof showing the main results hold for the 
DCSBM described in Section 2.5. For labels 2 £ [K'] n , define counts 


O a {z) = = a)Aij. 


We divide the sketch into three parts. 

i) Lemma 2.3 still holds. First note that the likelihood function f(z,A]9) is 
given by (up to a constant) 



where 1 ^ is the degree of node i and B(-) is the beta function. It follows then 
up to a constant, 


sup log f{z,A;6) 
eee K i 



+ O k ( z ) logn k (z) + ^log((n fc ( 2 ) - 1)!) 


k 


k 


log((O fc (z) + n k (z ) - 1)!) 


1 
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2 

=n ^2 u(n k (z)/n) + ^-F 2 (0(z)/n 2 , N(z)/n ) 
k 

Ok{z) + 

n k (z)n 

Conditional on the true labels and degree parameters (Z,u), define a new 
K' x it confusion matrix R c whose (k, a)-th entry is 

R k ,a = n ~ l = k, Zi = a), 

i 

Using appropriate Chernoff bounds, we can show O k .i(z)/p n concentrates 
around the expectation ( R c S*(R c ) T ) k j and a bound similar to (A.l) holds, 
that is, there exists a sequence e n -A 0 such that 

max ||A c ( 2 )|| 00 = o P (e n ), 
z&[K'] n 

where X c (z ) = 0(z)/p n — R c S* (R C ) T (z ). The same uniform bound holds 
for max z£ ^i|» ,ke[K’] ~ ( R c S*(R c ) T (z)l) k . Therefore the key term is 

still the one involving conditional expectations 

G 3 (R=(z),S‘) = G 2 (iT( 2 ),S*) - ( 1R ^^^ )llt ) 

recalling a(x ) = x log x. Noting that G 3 (-,S*) is convex and its domain 
{R c G R K xK | R c > 0, (R c ) t 1 = p} is again part of a convex polyhedron, 
the arguments in Lemma A.2 apply and we can conclude that 

1. When K’ = K, R C (Z) = diag(p ) gives the unique maximum. 

2. When K = K — 1, with an assumption similar to Assumption 2.1 a 
unique maximum is achieved at the configuration merging the correct 
pair of blocks. 

3. (A.7) holds for G 3 {R C ,S*). 

Therefore the first part of the proof of Lemma 2.3 still holds for z far away 
from Z’ (or Z for a it-block model). 

For \z — Z'\ = m, the convergence rates have to be adjusted for the degree 
parameters ui. Let s(z) = an d s z( z ) = Ylvzi^z'Xtf- First we 

note that for \z — y\ < m, the following concentration holds using Chernoff 
bounds, 

P(" max H-X’(^) — X(y)\\ 00 > e\ 

\z: z- y\- m / 


^ + O(logn) + constant. 


^(Ofc(z) + n k (z)) log 



(0.3) 


3 


< Y1 (^') 2 exp 

z:\z-y\=m 


Ce 2 p 2 


ns(z)p n + p 2 s 2 (z) 


Next noting that | R c (z) — R C (Z ')|i = Cs(z)/n, we have 
II X(z)-X(Z') 


max 


z^S(Z') \R c (z) - R c {Z')\i 
2 

) 

m >0 z:\z—Z'\=m 


> e 


- V V (Ii'r exp 


C 8 2 ( 2 » 2 


SE E (. K') 2 exp | — C min 

m >0 z:\z—Z'\—m 


ns(z)p n + p 2 s 2 (z) 

s(z)p n n 2 s 2 (z) 


n ’ s 2 (z) Ei^ 2 


(0.4) 


for n 1 / 2 p n / log n —>• oo, since E* ^ = Op(n) and s 2 (z)/s 2 (z) = ftp (run 1//2 (log n) *). 
Therefore 


0(z) 

O(Z') 

l^n 

Pn 


OO 

=o P (l)|i? c (z) - iT(Z')|i + ||i? c 5*(i? c ) i (z) - R c S*(R c y\Z 
(0.5) >|| R c (z) - R c (Z')\\i{C + op( 1)) = 0 P (m/n 3 / 2 ). 


The rest of the proof follows and Lemma 2.3 holds for both K — 1 and 
A'-block models. 

ii) Theorem 2-4 holds with different mean and variance terms. The max¬ 
imum likelihood estimates have the same convergence rates as in the case 
of standard SBM, similar expansions show Lk,k -i scaled the same way 
is asymptotically normal. Furthermore, as discussed in Section 2.2, the re¬ 
sult is generalizable to L KK ~ for K~ < K with similar assumptions and 
L k ,k~ < Tt P {n 2 p n ). 

iii) Lemma 2.6 follows by similar arguments to (A.31) and (A.32). Theo¬ 
rems 2.7 and 2.10 hold by similar expansions. 
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